19 research outputs found
goSLP: Globally Optimized Superword Level Parallelism Framework
Modern microprocessors are equipped with single instruction multiple data
(SIMD) or vector instruction sets which allow compilers to exploit superword
level parallelism (SLP), a type of fine-grained parallelism. Current SLP
auto-vectorization techniques use heuristics to discover vectorization
opportunities in high-level language code. These heuristics are fragile, local
and typically only present one vectorization strategy that is either accepted
or rejected by a cost model. We present goSLP, a novel SLP auto-vectorization
framework which solves the statement packing problem in a pairwise optimal
manner. Using an integer linear programming (ILP) solver, goSLP searches the
entire space of statement packing opportunities for a whole function at a time,
while limiting total compilation time to a few minutes. Furthermore, goSLP
optimally solves the vector permutation selection problem using dynamic
programming. We implemented goSLP in the LLVM compiler infrastructure,
achieving a geometric mean speedup of 7.58% on SPEC2017fp, 2.42% on SPEC2006fp
and 4.07% on NAS benchmarks compared to LLVM's existing SLP auto-vectorizer.Comment: Published at OOPSLA 201
Ithemal: Accurate, Portable and Fast Basic Block Throughput Estimation using Deep Neural Networks
Predicting the number of clock cycles a processor takes to execute a block of
assembly instructions in steady state (the throughput) is important for both
compiler designers and performance engineers. Building an analytical model to
do so is especially complicated in modern x86-64 Complex Instruction Set
Computer (CISC) machines with sophisticated processor microarchitectures in
that it is tedious, error prone, and must be performed from scratch for each
processor generation. In this paper we present Ithemal, the first tool which
learns to predict the throughput of a set of instructions. Ithemal uses a
hierarchical LSTM--based approach to predict throughput based on the opcodes
and operands of instructions in a basic block. We show that Ithemal is more
accurate than state-of-the-art hand-written tools currently used in compiler
backends and static machine code analyzers. In particular, our model has less
than half the error of state-of-the-art analytical models (LLVM's llvm-mca and
Intel's IACA). Ithemal is also able to predict these throughput values just as
fast as the aforementioned tools, and is easily ported across a variety of
processor microarchitectures with minimal developer effort.Comment: Published at 36th International Conference on Machine Learning (ICML)
201
Dias: Dynamic Rewriting of Pandas Code
In recent years, dataframe libraries, such as pandas have exploded in
popularity. Due to their flexibility, they are increasingly used in ad-hoc
exploratory data analysis (EDA) workloads. These workloads are diverse,
including custom functions which can span libraries or be written in pure
Python. The majority of systems available to accelerate EDA workloads focus on
bulk-parallel workloads, which contain vastly different computational patterns,
typically within a single library. As a result, they can introduce excessive
overheads for ad-hoc EDA workloads due to their expensive optimization
techniques. Instead, we identify program rewriting as a lightweight technique
which can offer substantial speedups while also avoiding slowdowns. We
implemented our techniques in Dias, which rewrites notebook cells to be more
efficient for ad-hoc EDA workloads. We develop techniques for efficient
rewrites in Dias, including dynamic checking of preconditions under which
rewrites are correct and just-in-time rewrites for notebook environments. We
show that Dias can rewrite individual cells to be 57 faster compared to
pandas and 1909 faster compared to optimized systems such as modin.
Furthermore, Dias can accelerate whole notebooks by up to 3.6 compared
to pandas and 26.4 compared to modin.Comment: 16 pages, 22 figure
CoMEt: x86 Cost Model Explanation Framework
ML-based program cost models have been shown to yield highly accurate
predictions. They have the capability to replace heavily-engineered analytical
program cost models in mainstream compilers, but their black-box nature
discourages their adoption. In this work, we propose the first method for
obtaining faithful and intuitive explanations for the throughput predictions
made by ML-based cost models. We demonstrate our explanations for the
state-of-the-art ML-based cost model, Ithemal. We compare the explanations for
Ithemal with the explanations for a hand-crafted, accurate analytical model,
uiCA. Our empirical findings show that high similarity between explanations for
Ithemal and uiCA usually corresponds to high similarity between their
predictions
FLuRKA: Fast fused Low-Rank & Kernel Attention
Many efficient approximate self-attention techniques have become prevalent
since the inception of the transformer architecture. Two popular classes of
these techniques are low-rank and kernel methods. Each of these methods has its
own strengths. We observe these strengths synergistically complement each other
and exploit these synergies to fuse low-rank and kernel methods, producing a
new class of transformers: FLuRKA (Fast Low-Rank and Kernel Attention). FLuRKA
provide sizable performance gains over these approximate techniques and are of
high quality. We theoretically and empirically evaluate both the runtime
performance and quality of FLuRKA. Our runtime analysis posits a variety of
parameter configurations where FLuRKA exhibit speedups and our accuracy
analysis bounds the error of FLuRKA with respect to full-attention. We
instantiate three FLuRKA variants which experience empirical speedups of up to
3.3x and 1.7x over low-rank and kernel methods respectively. This translates to
speedups of up to 30x over models with full-attention. With respect to model
quality, FLuRKA can match the accuracy of low-rank and kernel methods on GLUE
after pre-training on wiki-text 103. When pre-training on a fixed time budget,
FLuRKA yield better perplexity scores than models with full-attention.Comment: 9 pages, 4 figure
Input-sensitive dense-sparse primitive compositions for GNN acceleration
Graph neural networks (GNN) have become an important class of neural network
models that have gained popularity in domains such as social and financial
network analysis. Different phases of GNN computations can be modeled using
both dense and sparse matrix operations. There have been many frameworks and
optimization techniques proposed in the literature to accelerate GNNs. However,
getting consistently high performance across many input graphs with different
sparsity patterns and GNN embedding sizes has remained difficult.
In this paper, we propose different algebraic reassociations of GNN
computations that lead to novel dense and sparse matrix primitive selections
and compositions. We show that the profitability of these compositions depends
on the input graph, embedding size, and the target hardware. We developed
SENSEi, a system that uses a data-driven adaptive strategy to select the best
composition given the input graph and GNN embedding sizes. Our evaluations on a
wide range of graphs and embedding sizes show that SENSEi achieves geomean
speedups of (up to ) and (up to
) on graph convolutional networks and geomean speedups of
(up to ) and (up to ) on
graph attention networks on CPUs and GPUs respectively over the widely used
Deep Graph Library. Further, we show that the compositions yield notable
synergistic performance benefits on top of other established sparse
optimizations such as sparse matrix tiling by evaluating against a well-tuned
baseline
Learning Large Graph Property Prediction via Graph Segment Training
Learning to predict properties of large graphs is challenging because each
prediction requires the knowledge of an entire graph, while the amount of
memory available during training is bounded. Here we propose Graph Segment
Training (GST), a general framework that utilizes a divide-and-conquer approach
to allow learning large graph property prediction with a constant memory
footprint. GST first divides a large graph into segments and then
backpropagates through only a few segments sampled per training iteration. We
refine the GST paradigm by introducing a historical embedding table to
efficiently obtain embeddings for segments not sampled for backpropagation. To
mitigate the staleness of historical embeddings, we design two novel
techniques. First, we finetune the prediction head to fix the input
distribution shift. Second, we introduce Stale Embedding Dropout to drop some
stale embeddings during training to reduce bias. We evaluate our complete
method GST-EFD (with all the techniques together) on two large graph property
prediction benchmarks: MalNet and TpuGraphs. Our experiments show that GST-EFD
is both memory-efficient and fast, while offering a slight boost on test
accuracy over a typical full graph training regime
Automatically Harnessing Sparse Acceleration
Sparse linear algebra is central to many scientific programs, yet compilers
fail to optimize it well. High-performance libraries are available, but
adoption costs are significant. Moreover, libraries tie programs into
vendor-specific software and hardware ecosystems, creating non-portable code.
In this paper, we develop a new approach based on our specification Language
for implementers of Linear Algebra Computations (LiLAC). Rather than requiring
the application developer to (re)write every program for a given library, the
burden is shifted to a one-off description by the library implementer. The
LiLAC-enabled compiler uses this to insert appropriate library routines without
source code changes.
LiLAC provides automatic data marshaling, maintaining state between calls and
minimizing data transfers. Appropriate places for library insertion are
detected in compiler intermediate representation, independent of source
languages.
We evaluated on large-scale scientific applications written in FORTRAN;
standard C/C++ and FORTRAN benchmarks; and C++ graph analytics kernels. Across
heterogeneous platforms, applications and data sets we show speedups of
1.1 to over 10 without user intervention.Comment: Accepted to CC 202
Lifting high-performance stencil kernels from stripped x86 binaries to halide DSL code
Thesis: S.M., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2015.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from student-submitted PDF version of thesis.Includes bibliographical references (pages 97-100).Highly optimized programs are prone to bit rot, where performance quickly becomes suboptimal in the face of new hardware and compiler techniques. In this paper we show how to automatically lift performance-critical stencil kernels from a stripped x86 binary and generate the corresponding code in the high-level domain-specific language Halide. Using Halide's state-of-the-art optimizations targeting current hardware, we show that new optimized versions of these kernels can replace the originals to rejuvenate the application for newer hardware. The original optimized code for kernels in stripped binaries is nearly impossible to analyze statically. Instead, we rely on dynamic traces to regenerate the kernels. We perform buffer structure reconstruction to identify input, intermediate and output buffer shapes. We abstract from a forest of concrete dependency trees which contain absolute memory addresses to symbolic trees suitable for high-level code generation. This is done by canonicalizing trees, clustering them based on structure, inferring higher-dimensional buffer accesses and finally by solving a set of linear equations based on buffer accesses to lift them up to simple, high-level expressions. Helium can handle highly optimized, complex stencil kernels with input-dependent conditionals. We lift seven kernels from Adobe Photoshop giving a 75% performance improvement, four kernels from IrfanView, leading to 4.97x performance, and one stencil from the miniGMG multigrid benchmark netting a 4.25x improvement in performance. We manually rejuvenated Photoshop by replacing eleven of Photoshop's filters with our lifted implementations, giving 1.12× speedup without affecting the user experience.by Thirimadura Charith Yasendra Mendis.S.M